I/O Efficient Implementation of MapReduce
ثبت نشده
چکیده
MapReduce is a programming model and an associated implementation used by Google for processing their massive data sets. It has a simple yet powerful interface that is amenable to a broad variety of problems. Since 2003, when the MapReduce framework was first created, more than ten thousand distinct programs have been implemented under this model. A large number of MapReduce tasks are now running on Googles clusters at any minute, processing huge amounts of data and gathering lots of useful information. For instance, in the single month of September 2007, more than 2 million MapReduce jobs have been completed, processing over 400,000 TB of input data [2]. The success of MapReduce stems from the fact that it is very easy for the programmer to write a simple program and run it effi-ciently on a thousand machines, greatly improving the engineers productivity. In this project, (since we dont have a thousand machines,) we will study the problem of how to implement the MapReduce interface efficiently on a single machine. Since the data size could be much larger than memory, I/O-efficient techniques that you have learned from class will be useful (and required!).
منابع مشابه
ThemisMR: An I/O-Efficient MapReduce
“Big Data” computing increasingly utilizes the MapReduce programming model for scalable processing of large data collections. Many MapReduce jobs are I/O-bound, and so minimizing the number of I/O operations is critical to improving their performance. In this work, we present ThemisMR, a MapReduce implementation that reads and writes data records to disk exactly twice, which is the minimum amou...
متن کاملComparing Distributed Indexing: To MapReduce or Not?
Information Retrieval (IR) systems require input corpora to be indexed. The advent of terabyte-scale Web corpora has reinvigorated the need for efficient indexing. In this work, we investigate distributed indexing paradigms, in particular within the auspices of the MapReduce programming framework. In particular, we describe two indexing approaches based on the original MapReduce paper, and comp...
متن کاملSorting, Searching, and Simulation in the MapReduce Framework
In this paper, we study the MapReduce framework from an algorithmic standpoint and demonstrate the usefulness of our approach by designing and analyzing efficient MapReduce algorithms for fundamental sorting, searching, and simulation problems. This study is motivated by a goal of ultimately putting the MapReduce framework on an equal theoretical footing with the well-known PRAM and BSP paralle...
متن کاملI/O Throttling and Coordination for MapReduce
As a leading framework for data intensive computing, MapReduce has gained enormous popularity in large-scale data analysis. With the increasing adoption of multi/many core platform, more and more MapReduce tasks are now running on the same node and sharing the same storage resources. The concurrency of tasks raises the issue of I/O stream congestion. We have observed significant throughput drop...
متن کاملBuilding A Rendering System with MapReduce Framework ?
3D rendering is a kind of application which is not only computation intensive but also data intensive. 3D rendering can be easily parallelized by rendering different frames at the same time and the rendering work of each frame is independent from others. Parallel 3D rendering is typically I/O bound: many rendering programs spend lots of their execution time reading data from data server rather ...
متن کامل